Detection of rare sequence variants, methods and compositions therefor

ABSTRACT

The present disclosure encompasses methods of error corrected sequencing (ECS) that enable detection of very rare mutations well below the error rate of convention next generation sequencing (NGS). Further, the methods disclosed herein enable multiplex targeting of genomic DNA.

CROSS REFERENCE TO RELATED APPLICATIONS

This application is a divisional of U.S. application Ser. No.15/545,437, filed Jul. 21, 2017, which claims the benefit ofInternational Patent Application No. PCT/US16/14559, filed Jan. 22,2016, which claims the benefit of U.S. Provisional Application No.62/106,967, filed Jan. 23, 2015, the disclosure of which is herebyincorporated by reference in its entirety.

GOVERNMENTAL RIGHTS

This invention was made with government support under 1K08CA140720-01A1awarded by the NIH. The government has certain rights in the invention.

FIELD OF THE INVENTION

The present disclosure encompasses methods of error corrected sequencing(ECS) that enable detection of very rare mutations well below the errorrate of conventional next generation sequencing (NGS). Further, themethods disclosed herein enable multiplex targeting of genomic DNA.

BACKGROUND OF THE INVENTION

Massively parallel next generation sequencing is a powerful tool forwhole genome sequencing. Its low cost relative to prior methods and easein automation allow for large scale analyses of large genomes or manysamples with an error rate of 1%. For many sequencing applications, thisis sufficient, however when searching for rare mutations in aheterogeneous population, this 1% error rate can confound the isolationof single base mutations in a small population of cells with technicalsequencing errors. Detecting rare mutations at 2-5% variant allelefraction (VAF) using current methods requires costly and time-intensivedeep resequencing, and lower-frequency variants are undetectableregardless of sequencing depth.

Several groups have tried to mitigate this problem through a variety ofmethods, including counting PCR amplicons (Casbon et al. 2011), largeamounts of template and small numbers of cycles combined withstatistical analyses (Flaherty et al. 2012), tagging of DNA moleculesduring initial PCR (Miner et al 2004, Jabara et al. 2011, Smith et al2014 and Schmitt et al. 2012), and performing hybridization capturereactions in lieu of PCR (Hiatt et al 2013). The methods of Casbon andFlaherty require complex mathematical models on the current data and areunsuitable for high throughput applications.

Previous implementations of error-corrected next-generation sequencing(NGS) have limitations that have hampered their clinical applicability.First, some methods cannot be targeted and are not compatible withmultiplexing, which limits their ability to handle mammalian-sizedgenomes (Lou et al., 2013; Schmitt et al., 2012). The method of Schmittalso hinges on obtaining sequencing reads of both strands of the samemolecule. In theory, this would mean about half of the sequencing powerwould be lost due to pairing up of data strands, however, due toexperimental limitations, nearly three quarters of the sequencing readsare not included in the data analyses. While this is acceptable for someapplications, IIlumina® sequencing methods are expensive and this methodof error correction requires wasting resources on data that are nevergoing to be analyzed.

Several other targeted methods require large amounts of startingmaterial. Schmitt's method as described uses 3 μg of DNA isolated from aphage library in Escherichia coli for library preparation. Jabara used10,000 RNA molecules from a single HIV strain. Kinde and colleagues(2011) used a DNA library from 100,000 cells to isolate rare mutationsusing low-efficiency two-dimensional capture arrays. Such amounts oftemplate are not available for sequencing genomic DNA samples of limitedquantity.

SUMMARY OF THE INVENTION

In an aspect, the disclosure provides a method of identifying a geneticmutation in a biological sample comprising nucleic acid obtained from asubject. The method comprises: (a) amplifying one or more regions ofinterest from the biological sample comprising nucleic acid, wherein aplurality of amplicons for each region of interest are generated; (b)attaching an adapter and a random component to each amplicon generatedin (a) and amplifying; (c) sequencing the amplicons comprising therandom component generated in (b), wherein redundant reads are generatedand wherein the redundant reads are grouped by the random component anda consensus sequence is identified; and (d) comparing the consensussequence to a reference sequence, wherein a consensus sequence thatdiffers from the reference sequence comprises a genetic mutation.

In another aspect, the disclosure provides a method of identifying agenetic mutation in a biological sample comprising nucleic acid obtainedfrom a subject. The method comprises: (a) hybridizing a primer poolcomprising one or more primer pairs specific to one or more regions ofinterest from the biological sample comprising nucleic acid, extendingfrom an upstream primer of the primer pair to a downstream primer of theprimer pair, and ligating the extension product to the downstream primerof the primer pair, wherein products comprising the regions of interestflanked by sequences required for amplification are generated; (b)attaching an adapter comprising a random component and attaching anadapter comprising an index sequence to the products from (a) andamplifying; (c) sequencing the products comprising the random componentgenerated in (b), wherein redundant reads are generated and wherein theredundant reads are grouped by the random component and a consensussequence is identified; and (d) comparing the consensus sequence to areference sequence, wherein a consensus sequence that differs from thereference sequence comprises a genetic mutation.

In still another aspect, the disclosure provides a method of detectingminimal residual disease (MRD) in a subject. The method comprises: (a)hybridizing a primer pool comprising one or more primer pairs specificto one or more regions of interest from a biological sample comprisingnucleic acid obtained from the subject, extending from an upstreamprimer of the primer pair to a downstream primer of the primer pair, andligating the extension product to the downstream primer of the primerpair, wherein products comprising the regions of interest flanked bysequences required for amplification are generated; (b) attaching anadapter comprising a random component and attaching an adaptercomprising an index sequence to the products from (a) and amplifying;(c) sequencing the products comprising the random component generated in(b), wherein redundant reads are generated and wherein the redundantreads are grouped by the random component and a consensus sequence isidentified; and (d) comparing the consensus sequence to a referencesequence, wherein a consensus sequence that differs from the referencesequence comprises a genetic mutation and is indicative of MRD.

BRIEF DESCRIPTION OF THE FIGURES

The application file contains at least one drawing executed in color.Copies of this patent application publication with color drawing(s) willbe provided by the Office upon request and payment of the necessary fee.

FIG. 1A, FIG. 1B, FIG. 1C, FIG. 1D, FIG. 1E, FIG. 1F, FIG. 1G and

FIG. 1H depict graphs showing benchmarking for ECS and theidentification of rare pre-leukemic mutations. (FIG. 1A, FIG. 1B) DNAextracted from a diagnostic leukemia sample with known mutations inRUNX1 (FIG. 1A) and IDH2 (FIG. 1B) was serially diluted into non-cancer,unrelated human DNA. Two replicates were run per sample/dilution. Thecoefficient of determination (r²) between diluted tumor concentration inthe sample and VAF in the generated read families was 0.9999 and 0.9991for RUNX1 and IDH2, respectively. (FIG. 1C) The VAF at every nucleotidenot expected to contain mutations in the dilution series experiment wereanalyzed to determine the error profile of the error-corrected consensussequences compared with conventional deep sequencing. A cumulativedistribution function of VAF demonstrated a reduced error profile inread families relative to conventional deep sequenced reads. (FIG. 1D)The most frequent class of substitution seen in read families was in Gto T (C to A) transversions, which was consistent with oxidativeconversion of guanine to 8-oxo-guanine. (FIG. 1E, FIG. 1F) Theleukemia-specific variants identified in ASXL1 and U2AF1 at diagnosis(circled) were not distinguishable from sequencing errors in the samesubstitution class by conventional deep sequencing. (FIG. 1G, FIG. 1H)Targeted error-corrected sequencing identified the ASXL1 variant in the2002 banked sample at 0.004 VAF and the U2AF1 variant in the 2004 bankedsample at 0.009 VAF.

FIG. 2A, FIG. 2B, FIG. 2B, FIG. 2C, FIG. 2D, FIG. 2E and FIG. 2F depictschematics of the error-corrected sequencing workflow. Schematicdepiction of library preparation (FIG. 2A, FIG. 2B, FIG. 2C) andbioinformatics analysis (FIG. 2D, FIG. 2E, FIG. 2F) for generating readfamilies and error-corrected consensus sequences. First, the region ofinterest is amplified from genomic DNA (FIG. 2A), then the sequencinglibrary is prepared (FIG. 2B) generating a sequence library (FIG. 2C).From the sequence library, read families are generated (FIG. 2D) and anerror-corrected consensus sequence (ECCS) is created (FIG. 2E). TheECCSs are aligned to identify a variant allele (FIG. 2F).

FIG. 3 depicts a graph showing the cumulative distribution function ofthe error profile comparing ECS to conventional deep sequencing. Thevariant allele fraction for each non-variant position covered in thedilution series experiment was sorted and plotted cumulatively. Thevariant allele fractions of errors were higher in every nucleotidecovered across all substitution types for the raw sequenced readscompared the error-corrected consensus sequences generated from readfamilies.

FIG. 4 depicts a graph showing the cumulative distribution function ofread family error profile per specific substitution type with andwithout FPG pretreatment. The error profile of G to T (C to A)substitutions, consistent with guanine oxidation to 8-oxo guanine, washigher than the other classes of mutations. The C to T (G to A)substitutions, consistent with cytosine deamination to uracil, wasvisible just over the error profile for the remaining 8 types ofsubstitutions (inset). FPG pretreatment did not appreciably change theerror profile.

FIG. 5A, FIG. 5B, FIG. 5C, FIG. 5D, FIG. 5E and FIG. 5F depict graphsshowing ASXL1 mutations over time in UPN684949. Formalin-fixedparaffin-embedded bone marrow samples were banked over three years(2002, 2003, 2004) from this individual. Conventional deep sequencing(FIG. 5A, FIG. 5B, FIG. 5C) only distinguished the ASXL1 variant fromthe T to G sequencing errors in the 2003 banked sample at 0.097 VAF(FIG. 5B). FIG. 5A is samples from 2002 and FIG. 5C is samples from2004. Correcting the sequencing errors with ECS clearly identified theASXL1 variant at 0.0042 VAF in 2002 (FIG. 5D), 0.092 VAF in 2003 (FIG.5E) and 0.029 VAF in 2004 (FIG. 5F).

FIG. 6A, FIG. 6B, FIG. 6C, FIG. 6D, FIG. 6E and FIG. 6F depict graphsshowing U2AF1 mutations over time in UPN684949. Formalin-fixedparaffin-embedded bone marrow samples were banked over three years(2002, 2003, 2004) from this individual. Conventional deep sequencing(FIG. 6A, FIG. 6B, FIG. 6C) only distinguished the U2AF1 variant fromthe G to T sequencing errors in the 2003 banked sample at 0.036 VAF(FIG. 6B). FIG. 6A is samples from 2002 and FIG. 6C is samples from2004. Correcting the sequencing errors with ECS did not identify theU2AF1 variant in 2002 (FIG. 6D), but did identify the U2AF1 variant at0.031 VAF in 2003 (FIG. 6E) and 0.0089 VAF in 2004 (FIG. 6F).

FIG. 7 depicts a graph showing the error profile observed with increasedread family size. Read families generated with 3× or greater coverage(solid line) had a higher cumulative distribution of erroneoussubstitutions called compared to read families with 5× or greatercoverage (dotted line).

FIG. 8 depicts a graph showing the representative distribution of readfamily size. Singletons represent index sequences containing asequencing error. Excluding singletons, the median read family size was7× (mean 7.4×). Only read families with 5-20 reads were included in ECSanalysis.

FIG. 9A and FIG. 9B depict a method of multiplex targeted genomiccapture using the error-corrected sequencing methodology. FIG. 9Adepicts (a) the annealing of primers to genomic DNA, (b) single strandextension, and (c) ligation. FIG. 9B depicts (d) the newly mintedsingle-stranded amplicon after capture, (e) attachment of an adapterwith a sample-specific index (fixed) via PCR, (f) attachment of anadapter with an ECS index (random) via PCR, and (g) amplifying of thismolecule to make read families.

FIG. 10A and FIG. 10B depict graphs showing that the amplicon coveragebetween replicates is correlated. FIG. 10A shows to libraries sequencedon the same run with an R² of 0.9718 and FIG. 10B shows two librariessequenced on different runs with an R² of 0.7536.

FIG. 11 depicts a graph showing that the coverage per amplicon isvariable.

FIG. 12 depicts graphs showing the identification of constitutional and,importantly, rare SNVs in different samples. 49 germline (0.5/1.0 VAF)SNVs were identified, 5 high VAF (0.14-0.36 VAF) SNVs were identified,and 106 low VAF (<0.1 VAF) SNVs were identified for a total of 160 SNVsidentified.

FIG. 13 depicts graphs showing that rare subclones are detectedlongitudinally in the same healthy individual.

FIG. 14 depicts a graph showing that total rare subclonal variantsdetected per individual. The majority of the subclonal variants weredetected in exonic regions.

FIG. 15A and FIG. 15B depict pie charts showing the classification ofdetected rare subclonal variants. FIG. 15A shows the detected rarevariants by function of which the majority are exonic. FIG. 15B showsthe detected exonic rare variants by function of which the majority arenonsynonymous SNVs.

FIG. 16 depicts a graph showing that the detected exonic variantscluster in DNMT3A and TET2.

FIG. 17 depicts a graph showing that intronic variants are more evenlydistributed.

FIG. 18A and FIG. 18B depict graphs show that the variants are notexclusively called in highly covered amplicons. FIG. 18A shows thehistogram coverage per amplicon. FIG. 18B show the histogram coverageper amplicon with variants called.

FIG. 19A and FIG. 19B depict graphs showing the target space per genedoes not correlated with SNV calls per gene. FIG. 19A shows the exonicmutations per target space in the panel. FIG. 19B shows the intronicmutations per target space in the panel.

FIG. 20 depicts the distribution of exonic mutations by gene.

FIG. 21 depicts the spectrum of DNMT3A mutations.

FIG. 22 depicts the spectrum of TET2 mutations.

FIG. 23 depicts the distribution of rare subclonal mutations per person.

FIG. 24 depicts the COSMIC variants analyzed to validate the ECSmethodology disclosed herein with ddPCR.

FIG. 25A and FIG. 25B depict the concordance of VAF measured by ECS andddPCR. FIG. 25A shows that VAF measured by ECS is highly correlated withVAF measured by ddPCR (R²=0.98). FIG. 25B shows that even whenspecifically focusing on a VAF of <0.01, ECS and ddPCR still correlated(R²=0.72).

FIG. 26 depicts the number of singleton ddPCR droplets generated by flowsorting cells from the study participants and extracting genomic DNAfrom those flow sorted cells.

FIG. 27 depicts graphs showing that sublconal mutations are present inmultiple lineages in all tested samples.

FIG. 28 depicts graphs showing and expanded view of FIG. 27 focusing ona VAF of <0.01.

DETAILED DESCRIPTION OF THE INVENTION

The present inventors have developed sequencing methods foridentification of rare mutations. Methods of present disclosure can beused to quantify rare somatic mutations, such as, for example, DNA fromclinical specimens. Importantly, the limit of detection for thedisclosed method is at least two orders of magnitude below the errorrate of the IIlumina® sequencing platform performed by standard methods.

In various embodiments, methods of the present disclosure involve PCRamplification of multiple regions of interest in the genome, attachingadaptors comprising a random component and/or index sequences to theamplified DNA, performing sequencing, creating read families of the sameindex sequence and comparing reads in the same family. By these methods,true variations in the sequence can be distinguished from technicalartifacts.

The methodology disclosed herein is described in greater detail below.

(a) Sample Preparation

The disclosure encompasses a method of identifying a rare sequence in asample comprising nucleic acid. In an embodiment, the disclosureencompasses a method of identifying a genetic mutation in a samplecomprising nucleic acid. Specifically, the disclosure encompasses amethod of identifying a genetic mutation in a biological samplecomprising nucleic acid obtained from a subject. A first iteration ofthe method may be used to query about 1 to about 20 genomic loci (e.g.region of interest). A second iteration of the method may be used toquery about 1 to about 600 or more genomic loci.

A region of interest may be any nucleic acid amenable to standard PCR.Non-limiting examples of a region of interest may be a nucleic acid usedto identify a rare mutation or low levels associated withdrug-resistance, graft rejection, residual disease, tumors, immunediseases, fetal DNA, and microbial infection or contamination. Withrespect to microbial infection or contamination, a region of interestmay be a nucleic acid used to identify a bacterial strain. It is knownin the art that 16S nucleic acid is a good, widely used nucleic acid toidentify a bacterial strain. In an embodiment, the primer pair comprisessequences complementary to a 16S nucleic acid sequence. In anotherembodiment, the region of interest may be one or more nucleic acids usedto diagnose cancer, wherein a mutation within that region of interest isindicative of cancer. Specifically, the region of interest may be one ormore nucleic acids used to diagnose leukemia. For example, the region ofinterest may be any nucleic acid known to be mutated in leukemia.

The sample comprising nucleic acid may be a sample from a subject, theenvironment, a laboratory, or any sample in which nucleic acid ispresent. When the sample is from a subject, the sample may be fromstool, sputum, urine, plasma, peripheral blood, serum, bone marrow,tissue, and other bodily fluids. The tissue sample may be a tissuebiopsy. The biopsied tissue may be fixed, embedded in paraffin orplastic, and sectioned, or the biopsied tissue may be frozen andcryosectioned. Alternatively, the biopsied tissue may be processed intoindividual cells or an explant, or processed into a homogenate, a cellextract, a membranous fraction, or a protein extract. The sample may beused “as is” or the nucleic acid may be purified from the sample priorto sample preparation.

The subject may be a rodent, a human, a livestock animal, a companionanimal, or a zoological animal. In one embodiment, the subject may be arodent, e.g. a mouse, a rat, a guinea pig, etc. In another embodiment,the subject may be a livestock animal. Non-limiting examples of suitablelivestock animals may include pigs, cows, horses, goats, sheep, llamasand alpacas. In still another embodiment, the subject may be a companionanimal. Non-limiting examples of companion animals may include pets suchas dogs, cats, rabbits, and birds. In yet another embodiment, thesubject may be a zoological animal. As used herein, a “zoologicalanimal” refers to an animal that may be found in a zoo. Such animals mayinclude non-human primates, large cats, wolves, and bears. In apreferred embodiment, the subject is a human.

i. First Iteration

In an aspect, a method of the disclosure comprises, in part, amplifyingone or more regions of interest from a biological sample comprisingnucleic acid. The amplification generates a plurality of amplicons foreach region of interest.

Amplification takes place in the presence of one or more primer pairs. Afirst primer of the primer pair comprises a sequence complementary to anupstream portion of the region of interest and a second primer of theprimer pair comprises a sequence complementary to a downstream portionof the region of interest. The primer pairs are designed to anneal tocomplementary strands of nucleic acid (i.e. on primer of the primer pairanneals to the sense strand and one primer of the primer pair anneals tothe antisense strand). The complementary sequence may be altered basedon the region of interest to be amplified. The complementary sequencesof the primer pair may comprise about 10 to about 100 nucleotidescomplementary to the region of interest. For example, the complementarysequences of the primer pair may comprise about 15 to about 50nucleotides complementary to the region of interest. In an embodiment,the complementary sequences of the primer pair may comprise about 20 toabout 40 nucleotides complementary to the region of interest. In anotherembodiment, the complementary sequences of the primer pair may compriseabout 20 to about 35 nucleotides complementary to the region ofinterest.

One or more primer pairs is contacted with a sample comprising nucleicacid. Nucleic acid may be, for example, RNA or DNA. Modified forms ofRNA or DNA may be used. In an exemplary embodiment, the nucleic acid isgenomic DNA. The amount of nucleic acid in the sample may be about 200to about 1000 ng or more. For example, the amount of nucleic acid in thesample may be about 400 to about 800 ng. In certain embodiment, theamount of nucleic acid in the sample is about 200 ng, about 300 ng,about 400 ng, about 500 ng, about 600 ng, about 700 ng, about 800 ng,about 900 ng or about 1000 ng or more. In some embodiments, the amountof nucleic acid in the sample may be about 1 μg, about 5 μg, about 10μg, about 20 μg, about 30 μg, about 40 μg, or about 50 μg. It isimportant to note that as the amount of nucleic acid increases, theamount of random components (described below) must proportionallyincrease to ensure that the same random component is not utilized twice.A person of skill in the art would understand how to scale themethodology based on the amount of nucleic acid used.

In general, amplification of the region of interest is carried out usingpolymerase chain reaction (PCR). A PCR reaction may comprise samplecomprising nucleic acid, one or more primer pairs, polymerase, water,buffer, and deoxynucleotide triphosphates (dNTPs) in a single reactionvial. PCR may be performed according to standard methods in the art. Byway of non-limiting example, the PCR reaction may comprise denaturation,followed by about 15 to about 30 cycles of denaturation, annealing andextension, followed by a final extension. In an exemplary embodiment,the PCR reaction comprises denaturation at about 98° C. for about 30seconds, followed by about 15 to about 30 cycles of (about 98° C. forabout 10 seconds, about 62-72° C. for about 30 seconds, about 72° C. forabout 30 seconds), followed by a final extension at about 72° C. forabout 2 minutes.

In certain embodiments, a single reaction vial is used per primer pair.In other embodiments, a single reaction vial comprises more than oneprimer pair such that more than one region of interest is amplified perreaction vial. More specifically, amplification of a region of interestmay be multiplexed. Accordingly, a single reaction comprises primerpairs sufficient to amplify about 1-5, about 5-10, about 10-15, or about15-20 regions of interest. In other embodiments, a single reactioncomprises primer pairs sufficient to amplify 2, 3, 4, 5, 6, 7, 8, 9, 10,11, 12, 13, 14, 15, 16, 17, 18, 19, or 20 regions of interest.

In a different embodiment, a single reaction vial may comprise more thanone primer pair and the amplification may be carried out for about 10-20cycles, or about 15-20 cycles. Then, the amplicons may be separated intomore than one reaction vial, one primer pair is then added to eachreaction vial, and the reaction may be carried out for an additionalabout 10-20 cycles, or about 15-20 cycles.

Optionally, the amplicons may be purified prior to attaching an adapter,random component, and/or index sequence (described below in Section1(b)). Methods of purifying amplicons are known in the art. For example,AMPure bead cleanup may be used.

ii. Second Iteration

In an aspect, a method of the disclosure comprises, in part, hybridizinga primer pool comprising one or more primer pairs specific to one ormore regions of interest from the biological sample comprising nucleicacid, extending from an upstream primer of the primer pair to adownstream primer of the primer pair, and ligating the extension productto the downstream primer of the primer pair. The hybridization,extension and ligation generates products comprising the regions ofinterest flanked by sequences required for amplification.

The primer pool comprises one or more primer pairs designed to anneal tothe same strand of nucleic acid. A first primer of the primer paircomprises a sequence complementary to an upstream portion of the regionof interest and a second primer of the primer pair comprises a sequencecomplementary to a downstream portion of the region of interest. Thecomplementary sequence may be altered based on the region of interest tobe amplified. The complementary sequences of the primer pair maycomprise about 10 to about 100 nucleotides complementary to the regionof interest. For example, the complementary sequences of the primer pairmay comprise about 15 to about 50 nucleotides complementary to theregion of interest. In an embodiment, the complementary sequences of theprimer pair may comprise about 20 to about 40 nucleotides complementaryto the region of interest. In another embodiment, the complementarysequences of the primer pair may comprise about 20 to about 35nucleotides complementary to the region of interest. In an exemplaryembodiment, the primer pool is the TruSight® Myeloid Sequencing Panel(IIlumina).

The primer pool is contacted with a sample comprising nucleic acid.Nucleic acid may be, for example, RNA or DNA. Modified forms of RNA orDNA may be used. In an exemplary embodiment, the nucleic acid is genomicDNA. The amount of nucleic acid in the sample may be about 200 to about1000 ng or more. For example, the amount of nucleic acid in the samplemay be about 400 to about 800 ng. In certain embodiment, the amount ofnucleic acid in the sample is about 200 ng, about 300 ng, about 400 ng,about 500 ng, about 600 ng, about 700 ng, about 800 ng, about 900 ng orabout 1000 ng or more. In some embodiments, the amount of nucleic acidin the sample may be about 1 μg, about 5 μg, about 10 μg, about 20 μg,about 30 μg, about 40 μg, or about 50 μg. It is important to note thatas the amount of nucleic acid increases, the amount of random components(described below) must proportionally increase to ensure that the samerandom component is not utilized twice. A person of skill in the artwould understand how to scale the methodology based on the amount ofnucleic acid used.

Hybridization of the primer pool to the nucleic acid may be done viamethods standard in the art. For example, the primer pool and nucleicacid may be incubated at elevated temperature for about 1 to about 2hours. More specifically, the primer pool and nucleic acid may beincubated at about 95° C. for about 1 minute and then the temperaturemay be allowed to decrease to about 40° C. for about 80 minutes.

Following hybridization, a polymerase extends from the upstream primerthrough the region of interest, followed by ligation to the 5′ end ofthe downstream primer using ligase. This process results in theformation of products comprising the regions of interest flanked bysequences required for amplification. If the nucleic acid is DNA, thepolymerase may be any suitable DNA polymerase known in the art. Further,if the nucleic acid is DNA, the ligase may be any suitable DNA ligaseknown in the art. Extension and ligation may be carried out via methodsstandard in the art and dependent upon the polymerase and ligaseutilized. For example, extension and ligation may be conducted at about37° C. for about 45 minutes.

Optionally, following hybridization, the unbound primers may be washedaway prior to proceeding to the extension and ligation step. Methods ofwashing away unbound primers are known in the art.

A single reaction vial comprises the entire primer pool such that morethan one region of interest per reaction vial may be amplifieddownstream in the method. More specifically, the method enablesmultiplex targeting from genomic DNA. Accordingly, a single reactioncomprises primer pairs sufficient to hybridize to about 1-5 or about5-10, about 10-20, about 20-30, about 30-40, about 40-50, about 50-60,about 60-70, about 70-80, about 80-90, about 90-100, about 100-150,about 150-200, about 200-250, about 250-300, about 300-350, about350-400, about 400-450, about 450-500, about 500-550, about 550-600,about 600-700, about 700-800, about 800-900, or about 900-1000 regionsof interest. Alternatively, a single reaction comprises primer pairssufficient to hybridize to more than 100, more than 150, more than 200,more than 250, more than 300, more than 350, more than 400, more than450, more than 500, more than 550, more than 600, more than 650, morethan 700, more than 750, more than 800, more than 850, more than 900,more than 950, more than 1000, more than 1050, more than 1100, more than1150, more than 1200, more than 1250, more than 1300, more than 1350,more than 1400, more than 1450, more than 1500, more than 1550, morethan 1600, more than 1650, more than 1700, more than 1750, more than1800, more than 1850, more than 1900, more than 1950, or more than 2000regions of interest. In certain embodiments, a single reaction comprisesprimer pairs sufficient to hybridize to about 100, about 150, about 200,about 250, about 300, about 350, about 400, about 450, about 500, about550, about 600, about 650, about 700, about 750, about 800, about 850,about 900, about 950, about 1000, about 1050, about 1100, about 1150,about 1200, about 1250, about 1300, about 1350, about 1400, about 1450,about 1500, about 1550, about 1600, about 1650, about 1700, about 1750,about 1800, about 1850, about 1900, about 1950, or about 2000 regions ofinterest. In other embodiments, a single reaction comprises primer pairssufficient to hybridize to about 5, about 10, about 15, about 20, about25, about 30, about 35, about 40, about 45, about 50, about 55, about60, about 65, about 70, about 75, about 80, about 85, about 90, about95, or about 100 regions of interest. In still other embodiments, asingle reaction comprises primer pairs sufficient to hybridize to about500, about 510, about 515, about 520, about 525, about 530, about 535,about 540, about 545, about 550, about 555, about 560, about 565, about570, about 575, about 580, about 585, about 590, about 595, or about 600regions of interest.

(b) Error-Corrected Sequencing Library Preparation

A method of the disclosure comprises, in part, attaching an adapter,random component, and/or index sequence to each amplicon or productgenerated in Section I(a).

As used herein, an “adapter” is a sequence that permits universalamplification. A key feature of the adapter is to enable the uniqueamplification of the amplicon or product only without the need to removeexisting template nucleic acid or purify the amplicons or products. Thisfeature enables an “add only” reaction with fewer steps and ease ofautomation. The adapter is attached to the 5′ and 3′ end of the ampliconor product. The adapter may be Y-shaped, U-shaped, hairpin-shaped, or acombination thereof. In a specific embodiment, the adaptor is Y-shaped.In an exemplary embodiment, the adapter may be an Illumina adapter forIllumina sequencing.

As used herein, a “random component” is composed of random nucleotidesto generate a complexity of random components far greater than thenumber of unique amplicons or products to be sequenced. This ensuresthat having the same random component attached to multiple amplicons orproducts is an extremely statistically improbable event. A randomcomponent may also be referred to as a barcode. The random componentdesign can theoretically generate 9.1×10⁸ to 1.4×10¹° unique randomcomponents. This complexity can easily be expanded by increasing thelength of the random regions in the random component. In an embodiment,the random component may be about 5 to about 100 nucleotides. In anotherembodiment, the random component may be about 10 to about 25nucleotides. For example, the random component may be about 15 to about20 nucleotides. In still another embodiment, the random component isabout 16 to about 18 nucleotides. Accordingly, the random component maybe 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24 or 25 ormore nucleotides. The random component is attached to the 5′ or 3′ endof the amplicon or product. In a specific embodiment, the randomcomponent is attached to the 5′ end of the amplicon or product.

In addition to a random component and an adapter, an index sequence mayalso be attached to each amplicon or product generated. The addition ofan index sequence allows pooling of multiple samples into a singlesequencing run. This greatly increases experimental scalability, whilemaintaining extremely low error rates and conserving read length. Theindex sequence may be about 5 to about 10 nucleotides. Accordingly, theindex sequence may be 5, 6, 7, 8, 9 or 10 or more nucleotides. In anembodiment, the index sequence is about 6 nucleotides.

In a specific embodiment, an adapter, a random component and an indexsequence are attached to each amplicon or product. In an embodiment, anucleotide sequence comprising an adaptor and a random component isattached to the 5′ end of each amplicon or product and a nucleotidesequence comprising an adaptor and an index sequence is attached to the3′ end. In another embodiment, a nucleotide sequence comprising anadaptor and a random component is attached to the 3′ end of eachamplicon or product and a nucleotide sequence comprising an adaptor andan index sequence is attached to the 5′ end. In still anotherembodiment, a nucleotide sequence comprising an adaptor, a randomcomponent and an index sequence is attached to the 5′ end and anucleotide sequence comprising an adaptor is attached to the 3′ end. Instill yet another embodiment, a nucleotide sequence comprising anadaptor, a random component and an index sequence is attached to the 3′end and a nucleotide sequence comprising an adaptor is attached to the5′ end. In an exemplary embodiment, a nucleotide sequence comprising SEQID NO:1 is attached to each amplicon or product at the 5′ end and anucleotide sequence comprising SEQ ID NO:2 is attached to each ampliconor product at the 3′ end. In another exemplary embodiment, a nucleotidesequence comprising SEQ ID NO:2 is attached to each amplicon or productat the 5′ end and a nucleotide sequence comprising SEQ ID NO:1 isattached to each amplicon or product at the 3′ end.

The nucleotide sequence comprising an adapter, a random component and/oran index sequence may be attached to the amplicon or product via methodsknown in the art. In certain embodiments, the nucleotide sequencecomprising an adapter, a random component and/or an index sequence isligated to an amplicon or product via methods standard in the art. Forexample, the nucleotide sequence is annealed at about 95° C. for about 5minutes, then the temperature is decreased by about 1° C., about every30 seconds until about 4° C. Enrichment of the properly ligated productsis then carried out. Methods of enriching properly ligated products areknown in the art. For example, PCR amplification is carried out usingthe ligation product and appropriate primers. In an exemplaryembodiment, the PCR is carried out as follows: about 98° C. for about 30seconds, followed by about 6 cycles of about 98° C. for about 10seconds, about 57° C. for about 30 seconds, and about 72° C. for about30 seconds, finishing with an extension at about 72° C. for about 2minutes.

In other embodiments, the nucleotide sequence comprising an adapter, arandom component and/or an index sequence is attached to an amplicon orproduct via PCR. For example, the amplicon or product may be contactedwith a nucleotide sequence comprising an adaptor and an index sequenceand a PCR reaction is conducted. Then, this product is contacted with anucleotide sequence comprising an adaptor and a random component and aPCR reaction is conducted. The resulting product is a nucleotidesequence comprising an adaptor, a random component, a region ofinterest, an index sequence and a downstream adaptor. Alternatively, theamplicon or product may be contacted with a nucleotide sequencecomprising an adaptor and a random component and a PCR reaction isconducted. Then, this product is contacted with a nucleotide sequencecomprising an adaptor and an index sequence and a PCR reaction isconducted. The resulting product is a nucleotide sequence comprising anadaptor, an index sequence, a region of interest, a random component anda downstream adaptor.

The products or amplicons comprising an adapter, a random componentand/or an index sequence are then subjected to exponential PCR. In anembodiment, an exponential PCR reaction may comprise the products oramplicons comprising an adapter, a random component and/or an indexsequence, primers, polymerase, water, buffer, and deoxynucleotidetriphosphates (dNTPs) in a single reaction vial. Exponential PCR may beperformed according to standard methods in the art. By way ofnon-limiting example, the exponential PCR reaction may comprisedenaturation, followed by about 15-30 cycles of denaturation, annealingand extension, followed by a final extension. In a specific embodiment,the exponential PCR reaction comprises denaturation at about 95° C. forabout 3 minutes, followed by about 16-33 cycles of (about 95° C. forabout 30 seconds, about 62-72° C. for about 30 seconds, about 72° C. forabout 60 seconds), followed by a final extension at about 72° C. forabout 5 minutes.

Upon performing exponential PCR, the products or amplicons comprising anadapter, a random component and/or an index sequence are amplified. Theexponential PCR products comprise: an adapter, a random component, aregion of interest, a downstream adapter and an index sequence.

Optionally, the products or amplicons comprising an adapter, a randomcomponent and/or an index sequence may be purified prior to exponentialPCR. Methods of purifying products or amplicons are known in the art.For example, AMPure bead cleanup may be used.

(c) Error-Corrected Sequencing

A method of the disclosure comprises, in part, sequencing theexponential PCR product. According to the method of the disclosure,sequencing of the exponential PCR product generates redundant reads. Theredundant reads are grouped by random component and a consensus sequenceis identified such that the redundant reads mitigate sequence errors.

Sequencing may be performed according to standard methods in the art.Sequencing is preferably performed on a massively parallel sequencingplatform, many of which are commercially available including, but notlimited to Illumina, Roche/454, Ion Torrent, and PacBIO. In an exemplaryembodiment, Illumina sequencing is used.

Reads may be separated by the index sequence and trimmed to removeprimer sequences. Reads may be grouped by the random component. Incertain embodiments, groups of reads with less than three, less thanfour, or less than 5 reads may be removed. To eliminate ambiguoussequences, the random components may be sorted by abundance andclustered at an identity of about 85%. Alternatively, the randomcomponents may be sorted by abundance and clustered at an identity ofabout 65% to about 95%. The random components may be clustered from mostabundant to least abundant. Given that most sequencing errors are randomand that the correct sequence should occur more often than a variantwith sequencing errors, the abundance-weighted clustering provides ameans to eliminate spurious random components that are most likely dueto sequencing errors while retaining the more abundant (and most likelytrue positive) random components.

This redundant sequencing of each amplicon or product allows theerror-correction of each amplicon or product. For example, a consensussequence is generated for each random component group by scoring andweighing the nucleotide at each base position. Sequences with aconsensus sequence that is identical to the most abundant sequenceassociated with the same random component are kept, this process iscalled quality filtering. Specifically, at every position, thenucleotides called by each sequence read are compared and a consensusnucleotide is called if there is at least about 90% agreement betweenthe reads. If there is less than about 90% agreement, an “N” is calledin the consensus sequence at that position.

The inventors demonstrated that the methodology disclosed herein was 99%specific to detect variants above 0.0034 VAF for G to T (C to A)substitutions, 0.00020 VAF for C to T (G to A) substitutions, and0.000079 VAF for the other eight possible substitutions.

(d) Comparison to Reference Sequence

After an error-corrected consensus sequence (ECCS) has been identified,the ECCS may be compared to a reference sequence to determine thepresence of one or more mutations. A reference sequence may be asequence without any known mutations. A reference sequence without anyknown mutations may be referred to as a wild-type sequence. In certainembodiments, a reference sequence is a human sequence.

Comparison of the ECCSs to a reference sequence may identify clinicallysilent single nucleotide variations (SNVs). Specifically, the methoddisclosed herein may identify a genetic mutation that is present at afrequency of less than 1 in 1,000 in the sample (0.1%). For example, themethod disclosed herein may identify a genetic mutation that is presentat a frequency of less than 1 in 1,000, less than 1 in 2,000, less than1 in 3,000, less than 1 in 4,000, less than 1 in 5,000, less than 1 in6,000, less than 1 in 7,000, less than 1 in 8,000, less than 1 in 9,000,or less than 1 in 10,000 in the sample. In a specific embodiment, themethod disclosed herein may identify a genetic mutation that is presentat a frequency of less than 1 in 10,000 in the sample (0.01%).

II. Methods of Use

A method of the invention may be used to quantitate as well as todetermine a sequence. For example, the relative abundance of two or moreanalyte nucleic acid fragments may be compared. A method of theinvention may be used to identify rare mutants in a population of DNAtemplates, to measure polymerase error rates, or to judge thereliability of oligonucleotide synthesis. Additionally, a method of theinvention may be used to diagnose, treat or prevent a disease in asubject. Identification of a rare mutation could facilitate thediagnosis of a disease, enable the proper methodology, such as atherapeutic, to treat the disease, or prevent the onset of disease byadministration of prophylactic therapies. Still further, a method of theinvention may be used to detect genetic mutations involved in cancer orother diseases, such as immune-mediated diseases. In another embodiment,a method of the invention may be used to identify and quantify amicrobial infection of a subject. The knowledge gained may be used toassess the health of the subject. Further, a method of the invention maybe used as a quality control measurement in clinical labs or insynthetic biology to determine microbial contamination.

The results described in the examples below describe a method ofidentifying ultra-rare pre-leukemic clones using the methodologydescribed above. The methodology disclosed herein substantially improvesthe accuracy and depth of massively parallel sequencing. Thus, themethodology results in an assay to determine a VAF of 1:10,000 moleculesin individuals at high depth with high precision. The methodologydisclosed herein may be applied to virtually any sample preparationworkflow or sequencing platform. As demonstrated here, the approach caneasily be used to identify rare or low abundant mutations indicative ofdisease, such as leukemia.

EXAMPLES

The following examples are included to demonstrate preferred embodimentsof the invention. It should be appreciated by those of skill in the artthat the techniques disclosed in the examples that follow representtechniques discovered by the inventors to function well in the practiceof the invention, and thus can be considered to constitute preferredmodes for its practice. However, those of skill in the art should, inlight of the present disclosure, appreciate that many changes can bemade in the specific embodiments which are disclosed and still obtain alike or similar result without departing from the spirit and scope ofthe invention.

Example 1. Quantifying Ultra-Rare Pre-Leukemic Clones Via TargetedError-Corrected Sequencing

The quantification of rare clonal and subclonal populations from aheterogeneous DNA sample has multiple clinical and research applicationsfor the study and treatment of leukemia. Specifically, in thehematopoietic compartment, recent reports demonstrate the presence ofsubclonal variation in normal and malignant hematopoiesis,^(1,2) andleukemia is now recognized as an oligoclonal disease.³ Currently, clonalheterogeneity in leukemia is studied using next-generation sequencing(NGS) targeting subclone-specific mutations. With this method, detectingmutations at 2-5% variant allele fraction (VAF) requires costly andtime-intensive deep resequencing and identifying lower frequencyvariants is impractical regardless of sequencing depth. Recently,various methods have been developed to circumvent the error rate ofNGS.^(4,6) These methods tag individual DNA molecules with uniqueoligonucleotide indexes, which enable error correction after sequencing.

Here we present a direct application of error-corrected sequencing (ECS)to study clonal heterogeneity during leukemogenesis and validate theaccuracy of this method with a series of benchmarking experiments.Specifically, we demonstrate the ability of ECS to identifyleukemia-associated mutations in banked pre-leukemic blood and bonemarrow from patients with either therapy-related acute myeloid leukemia(t-AML) or therapy-related myelodysplastic syndrome (t-MDS). T-AML/t-MDSoccurs in 1-10% of individuals who receive alkylator- orepipodophyllotoxin-based chemotherapy or radiation to treat a primarymalignancy.⁶ For the seven individuals surveyed in this study, matchedleukemia/normal whole-genome sequencing identified thet-AML/t-MDS-specific somatic mutations present at diagnosis. We appliedour method for ECS to identify leukemia-specific mutations in fourindividuals from DNA extracted from blood and bone marrow samplescollected years before diagnosis. In a separate study into the role ofTP53 mutations in t-AML/t-MDS leukemogenesis, this method was used toidentify leukemia-associated mutations at low frequency in samplesbanked years before diagnosis.⁷ In two cases, subclones were identifiedbelow the 1% threshold of detection governed by conventional NGS. Theseresults highlight the ability of targeted ECS to identify clinicallysilent single-nucleotide variations (SNVs).

We employed ECS by tagging individual DNA molecules with adapterscontaining 16 bp random oligonucleotide molecular indexes in a mannersimilar to other reports.^(4,6,8) Our implementation of ECS easilytargets loci of interest by single or multiplex PCR and insertsseamlessly into the standard NGS library preparation (FIG. 2, Methodsfor Example 1). Our deviations from the standard protocol are ligationof customized adapters containing random indexes instead of themanufacturer's supplied adapters and a quantitative PCR (qPCR)quantification step before sequencing (Table 2). Following sequencing,sequence reads containing the same index and originating from the samemolecule are grouped into read families. Sequencing errors areidentified by comparing reads within a read family and removed to createan error-corrected consensus sequence (ECCS). We performed a dilutionseries experiment to assess bias during library preparation anddetermine the limit of detection for ECS. For this experiment, we spikedDNA from a t-AML sample into control human DNA, which was seriallydiluted over five orders of magnitude. The experiment was comprised oftwo technical replicates targeting two separate mutations (20 totalindependent libraries). The results demonstrate that ECS is quantitativeto a VAF of 1:10 000 molecules and provides a highly reproducibledigital readout of tumor DNA prevalence in a heterogeneous DNA sample(r² of 0.9999 and 0.9991, FIG. 1A, FIG. 1B). We next characterized theerror profile based on the wild-type nucleotides included in thedilution series experiment. Variant identification using the ECCSs was99% specific at a VAF of 0.0016 versus 0.0140 for deep sequencing alone(FIG. 1C). We noticed that ECCS errors were heavily biased towards G toT transversions and to a lesser degree C. to T transitions (FIG. 1D,FIG. 3), as previously observed.^(4,9) When separated by substitutiontype, variants identified from the ECCSs were 99% specific at a VAF of0.0034 for G to T (C to A) mutations, 0.00020 for C to T (G to A)mutations and 0.000079 for the other eight possible substitutions.Although excess G to T mutations are a known consequence of DNAoxidation leading to 8-oxo-guanine conversion,⁴ the pretreatment ofsamples with formamidopyrimidine-DNA glycosylase before PCRamplification did not appreciably improve the error profile of G to Tmutations (FIG. 4).

As proof of principle, we applied ECS to study rare pre-leukemic clonalhematopoiesis in seven individuals who later developed t-AML/t-MDS.Leukemia/normal whole-genome sequencing at diagnosis was used toidentify the leukemia-specific somatic mutations in each patient'smalignancy (Table 3). We applied targeted ECS to query these 18different loci in 10 cryopreserved or formalin-fixed paraffin-embeddedblood and bone marrow samples that were 9-22-year old and banked up to12 years before diagnosis (Table 4).

We generated ˜25 Gb of 150 bp paired-end reads from six Illumina (SanDiego, Calif., USA) MiSeq runs. We targeted 1-7 somatic mutations perindividual (25 mutations spanning 5.5 kb from 15 genes in total) andidentified leukemia-specific subclonal populations in four individualsup to 12 years before diagnosis (Table 1). For each sequencing library,we tagged ˜2.5 million locus-specific amplicons generated from genomicDNA using high-fidelity PCR with randomly indexed custom adapters.Sequencing errors were removed to create ECCSs as described above. EachECCS was then aligned to the reference genome for variant calling (FIG.2).

Using conventional deep sequencing, we detected t-AML/t-MDS-specificmutations in prior banked samples at variant allele fractions between0.03 and 0.87 (data not shown). In one individual (UPN 684949), deepsequencing alone was insufficient to distinguish known ASXL1 and U2AF1mutations from the sequencing errors in samples banked 5 and 3 yearsbefore t-MDS diagnosis, respectively (FIG. 1E, FIG. 1F). However, ECSidentified the L866* nonsense mutation in ASXL1 at a VAF of 0.004 (FIG.1G) and the S34Y missense mutation in U2AF1 at a VAF of 0.009 (FIG. 1H.In addition, ECS was able to temporally quantify these mutations fromthree pre-t-MDS samples banked yearly from 3 to 5 years before diagnosis(FIG. 5, FIG. 6). In two cases (UPN643006 and UPN942008), only a subsetof the variants identified at diagnosis were present in the prior bankedsample (Table 1). Specifically, in the UPN643006 sample, banked 12 yearsbefore diagnosis, a single-nucleotide deletion in ASXL1 was present atVAF 0.03. But, the G to T substitution in ASXL1, CTT deletion in GATA2and G to T substitution in U2AF1 were not detectable in this priorbanked sample.

Here we present a practical and clinically oriented application fortargeted error-corrected NGS utilizing single molecule indexing. Thismethod easily integrates into existing NGS library preparation protocolsand enables the quantification of previously undetectable mutations inheterogeneous DNA samples. A modification to the standard NGS librarypreparation is the replacement of the stock adapters with our randomlyindexed adapters and the addition of a qPCR step before sequencing. TheqPCR step limits the number of molecules sequenced, ensuring adequatecoverage for each read family. With these two modifications, we achievehighly specific detection for rare mutations. The bioinformaticsanalysis is straightforward and does not require proprietary algorithmsor tools (Methods for Example 1). Our results highlight the ability ofthis method to identify rare subclonal populations in a heterogeneousbiological sample. As applied to t-AML/t-MDS, we show these previouslyundetectable mutations are present years before diagnosis and fluctuatein prevalence over time.

A clinical application of ECS is to quantify minimal residual disease(MRD). As the genomic characterization of leukemia becomes more readilyavailable, identifying causative genetic lesions and raretherapy-resistant subclones will become increasingly useful for riskstratification, therapeutic selection and disease monitoring. Already,whole-genome sequencing of AML has demonstrated that nearly every caseof AML harbors one or more somatic SNVs.¹⁰ These SNVs are more reliableclonal markers of malignancy than cell surface markers, which can changeover time. Leveraging this information, conventional NGS was implementedretrospectively to detect MRD harboring leukemia-specificinsertions/deletions (indels) as rare as 0.00001 VAF in NPM1¹¹ and0.0001 VAF in RUNX1.¹² This was possible because indels are only rarelygenerated erroneously by NGS. Unfortunately, measuring rareleukemia-associated substitutions is limited owing to the relativelyhigh error profile of conventional NGS.¹³ However, ECS can achieve the1:10,000 limit of detection featured by conventional MRD platforms.¹⁴For patients whose leukemia lacks suitable markers for conventional MRD,ECS could offer an alternative with comparable sensitivity andspecificity that is easy to implement in a clinical sequencing lab.Furthermore, the ability to multiplex targets for ECS enables thesurveillance of known mutations and the simultaneous discovery of newsomatic mutations. Ongoing work will directly compare gold-standard MRDmethods with targeted ECS in patients with and without relapsedleukemia.

Methods for Example 1.

Study Design: Blood and bone marrow samples from patients treated fort-AML/t-MDS at Washington University were banked or accessed followinginformed consent under Human Research Protection Protocol #201011766.Patients included in this study underwent matched leukemia andnon-cancer (skin) whole genome sequencing on the IIlumina HiSeq 2500platform, which identified tumor-specific somatic coding mutations inleukemia samples. Our study focused on identifying these known mutationsfrom matched blood or bone marrow samples banked 1-12 years prior to theinitial diagnosis of t-AML/t-MDS.

Sample Preparation: Genomic DNA was generated from either FFPE orcryopreserved peripheral blood or bone marrow samples using the QIAampDNA FFPE Tissue or DNA Mini Kit (Qiagen). PCR primers were designedusing primer3¹ to amplify regions harboring individual leukemia-specificmutations from the banked biological samples (Table 5). Theconcentration of each purified DNA sample was determined using the QubitdsDNA HS Assay Kit (Life Technologies). Genomic DNA (400-800 ng) wasamplified using the Q5 High-Fidelity 2X Master Mix (New England Biolabs)in a 25 uL reaction with 0.5 uM primers (FIG. 2A). The followingconditions were used: 98 C for 30 s; 16-30 cycles of 98 C for 10 s,62-72 C (based on a separate optimization) for 30 s and 72 C for 30 s;72 C for 2 m; hold 10 C. The PCR reactions were purified using theAgencourt AMPure XP (Beckman Coulter) bead-based protocol withoutmodification.

For a few of the patient samples, the amount of input genomic DNA waslimited. In these cases, modifications were made to the protocol toamplify multiple leukemia-specific mutations from the same biologicalsample (multiplex PCR). Patient-specific primers were pooled during afirst round of PCR and amplified for roughly 16 cycles, similar topre-amplification described in TAm-Seq². After purification the DNA wassplit into a single PCR reaction per patient-specific SNVs and amplifiedusing only that specific primer pair, again for roughly 16 cycles. Thisallowed us to generate diverse amplicon pools for multiple loci usingonly 400-800 ng of starting DNA.

ECS Library Preparation: The concentration of the purified PCR productswas measured using the Qubit dsDNA HS Assay Kit (Life Technologies). NGSlibraries were prepared from 800 ng of amplicons for eachsample/mutation using the Illumina TruSeq DNA Sample Preparation Kit(Illumina). We replaced the Illumina-provided Y-shaped adapters withcustom adapters containing a random 16 base pair oligonucleotide indexsequence (Table 2). Adapters were diluted to 40 uM in Tris-EDTA with 5nM NaCl and annealed using the following conditions: 95 C for 5 m thendecreased by 1 C every 30 s to 4 C. Aside from the custom adapters usedfor ligation, the library preparation protocol from Illumina was mostlyunchanged (FIG. 2B). Enrichment for correctly ligated products wascompleted using a 50 uL Q5 PCR amplification with 2 uL of ligationproduct and 0.5 uM Illumina specific primers under the followingconditions: 98 C for 30 s; 6 cycles of 98 C for 10 s, 57 C for 30 s and72 C for 30 s; 72 C for 2 m; hold 10 C The PCR reaction was purifiedusing a modified Ampure bead cleanup, which increased the size range ofpurification to remove adapter dimers. 100 uL of beads were washed twicewith ddH2O to remove the stock polyethylene glycol (PEG) solution. Thesolution was replaced with 25.5 uL 50% wt/vol PEG (Sigma), 37.5 uL 5MNaCl and 37 uL ddH2O. The PCR reaction was added to this solution andpurified per the standard Ampure protocol.

Quantification by qPCR: We sought to generate read families from asingle randomly-indexed molecule with roughly seven-fold coverage. Giventhe bandwidth of a single Illumina MiSeq run was roughly 15-18 millionread pairs, we sought to generate sequencing libraries from roughly 2.5million molecules. To achieve this, we quantified the concentration ofeach library using the qPCR NGS Library Quantification Kit, Illumina GA(Agilent Technologies). Based on the measured concentration, eachlibrary was diluted to 0.4 pM such that a 10 uL volume of the dilutedlibrary would contain ˜2.5 million molecules. The 10 uL aliquot ofdiluted sequencing library was then amplified for 16-20 cycles andpurified with the same Q5 and modified Ampure bead protocol used for theprevious enrichment PCR step. The final library was visualized on a 2%SYBR Safe gel (Life Technologies) and quantified using Qubit dsDNA HSAssay Kit. When multiplexing samples on a single lane of sequencing,individual sequencing libraries were combined in equimolar amounts afterenrichment PCR and the pooled sample was diluted and quantified usingqPCR as stated previously. However, we also found it possible to poolamplicons in equimolar amounts after the initial genomic DNAamplification and make a single sequencing library. Up to 7 differentamplicons were multiplexed on a single MiSeq run. Multiplexing was onlypossible with mutations in different genes or within different exons ofthe same gene because the samples were demultiplexed by alignment.

Sequencing: Each library was sequenced on the IIlumina MiSeq instrumentas specified by the manufacturer (FIG. 2C). Approximately, 5-10% of PhiXcontrol DNA was spiked into each sequencing experiment. Each completedsequencing run contained roughly 15-18M paired-end 150 bp reads. Rawsequence reads were aligned to the PhiX genome using Bowtie 2³. Sequencereads aligning to PhiX were removed from further analysis. The remainingsequence reads were aligned to UCSC hg19/GRCh37 using Bowtie 2 forcomparison against error-corrected consensus sequences (ECCS) derivedfrom read families (below).

Error Corrected Consensus Sequences: Sequence reads containing the sameindex sequence (originated from the same randomly-indexed molecule) werealigned to each other to generate read families in a fashion similar topreviously published methods^(4,5) (FIG. 2D). Previous studies used aminimum read family size of threes. We found using a more stringentcutoff of five reduced the error rate in the read families (FIG. 7). Themedian read family size was seven reads per index (FIG. 8). Paired-endreads within a read family were error corrected in a stepwise fashion(FIG. 2E). First, at every position, the nucleotides called by eachsequence read were compared and a consensus nucleotide was called ifthere was at least 90% agreement between the reads. If there was lessthan 90% agreement, an N was called in the consensus sequence at thatposition. Errors that occurred during library preparation and sequencingwere removed because they were not shared between different reads withina read family. Second, an ECCS was thrown out if less than 90% of the300 nucleotides comprising the paired-end read were assigned a non-Nnucleotide. These ECCSs were locally aligned to UCSC hg19/GRCh30 usingBowtie2³ (FIG. 2F). The aligned ECCSs were processed with Mpileup⁶ usingthe parameters -BQ0 -d 10000000000000. This removed the coveragethresholds to ensure that all of the pileup output was returnedregardless of variant allele fraction (VAF) or coverage. Variant allelefactions comprised of both the expected mutations and the backgrounderrors for each sample were visualized using IGV⁷ and graphicallyrepresented using ggplot2⁸. Each known variant was plotted relative tothe error-profile of that specific substitution class (e.g. an expectedC to T transition was compared against the C to T error profile).Variants distinguishable from the noise for that specific error classand located at the expected position within the amplicon were calledtrue positives. The threshold for calling true variants varied based onthe error profile of that substitution class. Based on our benchmarkingstudies we were 99% specific to detect variants above 0.0034 VAF for Gto T (C to A) substitutions, 0.00020 VAF for C to T (G to A)substitutions and 0.000079 VAF for the other eight possiblesubstitutions.

TABLE 1 Patient-specific leukemia-associated somatic mutationsidentified by ECS. Sample Years Amino-acid Variant Reference UPN IDprior Gene Chr Position Mut change RFs RFs VAF 446294 75.02 1 OBSCN 1228461129 A to G H1857R 61 238 156 986 0.2806 TP53 17 7578271 T to AH193L 220 551 110 047 0.6671 499258 24.06 2 RUNX1 21 36252865 C to GR139P 2 486 196 0 574214 26.04 7 DMD X 32827676 G to A R187* 7 199 945 0643006 80.01 12 ASXL1 20 31022448 G to T G645C 7 85 781 0.0001 ASXL1 2031022442 del G G645fs 2 898 82 245 0.034 GATA2 3 128200135 del CTTK390in_fr_del 0 4 187 0 U2AF1 21 44524456 G to T S34Y 85 414 613 0.0002684949 91.01 5 ASXL1 20 31023112 T to G L866* 3 583 853 598 0.0042 U2AF121 44524456 G to T S34Y 545 514 410 0.0011 92.02 4 ASXL1 20 31023112 Tto G L866* 54 074 535 976 0.0916 U2AF1 21 44524456 G to T S34Y 11 195355 276 0.0305 93.01 3 ASXL1 20 31023112 T to G L866* 17 319 573 6290.0293 U2AF1 21 44524456 G to T S34Y 827 92 104 0.0089 856024 30.02 1S100A4 1 153517192 A to G F27L 0 211 512 0 IGSF8 1 160062252 G to AP516S 0 22 614 0 PLA2R1 2 160798389 A to G L1431P 2 338 616 0 POU3F2 699282794 C to A S15R 8 201 240 0 ANKRD18B 9 33524645 G to A C53Y 7 214836 0 ESR2 14 64701847 G to A A416V 10 135 861 0.0001 FBN3 19 8155081 Gto A P2029L 0 152 304 0 942008 33.04 9 IDH2 15 90631934 C to T R88Q 23170 236 587 0.0892 RUNX1 21 36231791 T to C D171G 40 253 168 0.0002107.01 <1 IDH2 15 90631934 C to T R88Q 138 180 161 371 0.4613 RUNX1 2136231791 T to C D171G 368 438 50 796 0.8788 Abbreviations: ECS,error-corrected sequencing; RFs, read families; VAF, variant allelefraction. Two to seven mutations were queried per individual and thenumber of read families containing the variant allele or referenceallele were reported and used to calculate the variant allele fraction.

TABLE 2 Random 16-mer molecular indexed adapters. Theterminal 5-prime phosphorylation on complemen-tary adapter sequence was used to improve ligation efficiency (*). SEQID Label Sequence NO: 16N AGACGGCATACGAGATNNNNNNNNNNNNN 1 IndexNNNGTGACTGGAGTTCAGACGTGTGCTCT Adapter TCCGATCT Complementary*GATCGGAAGAGCGTCGTGTAGGGAAAGA 2 Adapter GTGTAGATCTCGGTGGTCGCCGTATCATT

TABLE 3 Whole-genome sequencing of diagnosis t-AML/t-MDS samples.Reference Variant UPN Gene Chr Position Mutation AA Change Reads ReadsVAF 446294 OBSCN 1 228461129 A to G H1857R 3 5 0.63 TP53 17 7578271 T toA H193L 79 106 0.57 499258 RUNX1 21 36252865 C to G R139P 122 17 0.12574214 DMD X 32827676 G to A R187* 103 73 0.41 643006 ASXL1 20 31022448G to T G645C 36 32 0.47 ASXL1 20 31022442 del G G645fs 33 32 0.49 GATA23 128200135 del CTT K390in_frame_de 8 10 0.56 U2AF1 21 44524456 G to TS34Y 24 27 0.53 684949 ASXL1 20 31023112 T to G L866* 75 14 0.16 U2AF121 44524456 G to T S34Y 57 9 0.14 856024 S100A4 1 153517192 A to G F27L103 48 0.32 IGSF8 1 160062252 G to A P516S 28 42 0.60 PLA2R1 2 160798389A to G L1431P 45 33 0.42 POU3F2 6 99282794 C to A S15R 15 15 0.50ANKRD18 9 33524645 G to A C53Y 26 20 0.43 ESR2 14 64701847 G to A A416V40 22 0.35 FBN3 19 8155081 G to A P2029L 54 38 0.41 942008 IDH2 1590631934 C to T R88Q 10 10 0.50 RUNX1 21 36231791 T to C D171G 15 350.70

TABLE 4 Summary of patient information. The type of primary malignancy,the date of primary malignancy diagnosis, the date and type ofblood/bone marrow banked prior to t-AML/t-MDS diagnosis and the date oft-AML/t-MDS diagnosis are included in the table below. At t-AML/t- MDSdiagnosis, tumor/normal whole genome sequencing identifiedleukemia-specific mutations. Some of the prior banked blood/bone marrowsamples showed evidence of subclonal populations harboring thoseleukemia-specific mutations before the clinical detection of disease.Primary Date t-AML/t- Evidence of Malignancy Primary Banked Banking DateMDS Pre-Leukemic UPN Diagnosis Malignancy Samples Type Banked DiagnosisSubclones 446294 Breast cancer 2002 75.02 FFPE 07/2005 2006 (t-MDS) Yes499258 Hodgkin's 1998 24.06 Cryo 02/2002 2004 (t-MDS) No 574214 Breastcancer 1998 26.04 Cryo 01/2000 2007 (t-MDS) No 643006 AML 1989 80.01FFPE 04/1992 2004 (t-MDS) Yes 684949 CLL 09/1991 91.01 FFPE 11/2002 2007(t-MDS) Yes 92.02 FFPE 09/2003 Yes 93.01 FFPE 10/2004 Yes 856024 NHL11/2004 30.02 Cryo 03/2005 2006 (t-AML) No 942008 NHL 08/1992 33.04 Cryo09/1996 2005 (t-AML) Yes 107.01 FFPE 11/2005 Yes

TABLE 5 Primers targeting leukemia-specific variants.Primer sequences used to generate variant-specific amplicons from banked genomic DNA samples. SEQ SEQ ID ID UPNGene FWD Primer NO: Reverse Primer NO: 446294 OBSCN GGAGCCTCTGA  3CCCGCCTCACAGCT  4 CCCTGCATCCC GTACTCCCCAG TCC TP53 AGACCTCAGGC  5GGGGCTGGAGAGAC  6 GGCTCATAGGG GACAGGGCTG CAC 499258 RUNX1 TCACTAGAATT  7GCACTCTGGTCACT  8 TTGAAATGTGG GTGATGGCTGGC GTTTGTTGCC 574214 DMDGGCGATGTTGA  9 AGGACTATGGGCAT 10 ATGCATGTTCC TGGTTGTCAAT AGT 643006ASXL1 GGACCCTCGCA 11 GCCTCACCACCATC 12 GACATTAAAGC ACCACTGCTGC CCGTGATA2 CCACAGGTGCC 13 CTGTGGCGGGGTGG 14 ATGTGTCCAGC GAGGAATGTTG CAG U2AF1TGAACACAAAT 15 CCCAGCAAAATAAT 16 GGAAAATACAA CAGCTCTCATTTTC CTACGAGAGAACC AA 684949 ASXL1 CACTATGAAGG 17 TGGTTTGGGCTGTT 18 ATCCTGTAAATTCACTACCTCA GTGACCCC U2AF1 TGAACACAAAT 15 CCCAGCAAAATAAT 16 GGAAAATACAACAGCTCTCATTTTC CTACGAGAGAA CC AA 856024 S100A4 CCACGTGGGGA 19AATAAGACGGTCTC 20 CTCACTCAGGC TGTGCCTCCTG A IGSF8 TGGTACACGCC 21GCTCAGCTCTGTCC 22 TTCATCCTCGG CTGCCCAGCT G PLA2R1 ACCCTGGTGTC 23AGTCACAGCATCAT 24 TGTGGCATTCT TCCTCTTGCGGT CTG POU3F2 CAAATGCGCGG 25GCGTGGCTGAGCGG 26 CTCCTTTAACC GTGTCC GGA ANKRD18B TACCACATTCG 27CTCCCAGGGTCCCG 28 GGACTGGGAAC GCGAACTCC TGC ESR2 TGGCAATCACC 29AACCCAGATCACCT 30 CAAACCAAAGC CGGAGCAGGCG ATCGGT FBN3 GGGGACACAGT 31GACTGGGGTGCGGG 32 TCGCAGGGGTC AGGTCACAGG 942008 IDH2 GGCGTGCCTGC 33CCGTCTGGCTGTGT 34 CAATGGTGATG TGTTGCTTGGGG GG RUNX1 ACATGGTCCCT 35GGCCACCAACCTCA 36 GAGTATACCAG TTCTGTTTTGT CCT

References for Example 1

-   1 Holstege H, Pfeiffer W, Sie D, Hulsman M, Nicholas T J, Lee C C et    al. Somatic mutations found in the healthy blood compartment of a    115-yr-old woman demonstrate oligoclonal hematopoiesis. Genome Res    2014; 24: 733-742.-   2 Walter M J, Shen D, Ding L, Shao J, Koboldt D C, Chen K et al.    Clonal architecture of secondary acute myeloid leukemia. N Engl J    Med 2012; 366: 1090-1098.-   3 Welch J S, Ley T J, Link D C, Miller C A, Larson D E, Koboldt D C    et al. The Origin and Evolution of Mutations in Acute Myeloid    Leukemia. Cell 2012; 150: 264-278.-   4 Schmitt M W, Kennedy S R, Salk J J, Fox E J, Hiatt J B, Loeb L A.    Detection of ultra-rare mutations by next-generation sequencing.    Proc Natl Acad Sci USA 2012; 109: 14508-14513.-   5 Kinde I, Wu J, Papadopoulos N, Kinzler K W, Vogelstein B.    Detection and quantification of rare mutations with massively    parallel sequencing. Proc Natl Acad Sci USA 2011; 108: 9530-9535.-   6 Godley L A, Larson R A. Therapy-related myeloid leukemia. Semin    Oncol 2008; 35: 418-429.-   7 Wong T, Ramsingh G, Young A L, Miller C A, Touma W, Welch J S et    al. The role of TP53 mutations in the origin and evolution of    therapy-related AML. Nature 2015; 518: 552-555.-   8 Fu G K, Xu W, Wilhelmy J, Mindrinos M N, Davis R W, Xiao W et al.    Molecular indexing enables quantitative targeted RNA sequencing and    reveals poor efficiencies in standard library preparations. Proc    Natl Acad Sci USA 2014; 111: 1891-1896.-   9 Lou D I, Hussmann Ja, McBee R M, Acevedo A, Andino R, Press W H et    al. High-throughput DNA sequencing errors are reduced by orders of    magnitude using circle sequencing. Proc Natl Acad Sci USA 2013; 110:    19872-19877.-   10 Cancer Genome Atlas Research Network. Genomic and epigenomic    landscapes of adult de novo acute myeloid leukemia. N Engl J Med    2013; 368: 2059-2074.-   11 Salipante S J, Fromm J R, Shendure J, Wood B L, Wu D. Detection    of minimal residual disease in NPM1-mutated acute myeloid leukemia    by next-generation sequencing. Mod Pathol 2014; 27: 1438-1446.-   12 Kohlmann a, Nadarajah N, Alpermann T, Grossmann V, Schindela S,    Dicker F et al. Monitoring of residual disease by next-generation    deep-sequencing of RUNX1 mutations can identify acute myeloid    leukemia patients with resistant disease. Leukemia 2014; 28:    129-137.-   13 Loman N J, Misra R V, Dallman T J, Constantinidou C, Gharbia S E,    Wain J et al. Performance comparison of benchtop high-throughput    sequencing platforms. Nat Biotechnol 2012; 30: 434-439.-   14 Hourigan C S, Karp J E. Minimal residual disease in acute myeloid    leukaemia. Nat Rev Clin Oncol 2013; 10: 460-471.

References for the Methods for Example 1

-   1 Untergasser A, Cutcutache I, Koressaar T, Ye J, Faircloth B C,    Remm M et al. Primer3—new capabilities and interfaces. Nucleic Acids    Res 2012; 40: e115.-   2 Forshew T, Murtaza M, Parkinson C, Gale D, Tsui D W Y, Kaper F et    al. Noninvasive identification and monitoring of cancer mutations by    targeted deep sequencing of plasma DNA. Sci Transl Med 2012; 4:    136ra68.-   3 Langmead B, Salzberg S L. Fast gapped-read alignment with Bowtie    2. Nat Methods 2012; 9: 357-9.-   Kinde I, Wu J, Papadopoulos N, Kinzler K W, Vogelstein B. Detection    and quantification of rare mutations with massively parallel    sequencing. Proc Natl Acad Sci USA 2011; 108: 9530-5.-   5 Schmitt M W, Kennedy S R, Salk J J, Fox E J, Hiatt J B, Loeb L a.    Detection of ultra-rare mutations by next-generation sequencing.    Proc Natl Acad Sci USA 2012; 109: 14508-13.-   6 Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N et al.    The Sequence Alignment/Map format and SAMtools. Bioinformatics 2009;    25: 2078-9.-   7 Thorvaldsdóttir H, Robinson J T, Mesirov J P. Integrative Genomics    Viewer (IGV): high-performance genomics data visualization and    exploration. Brief Bioinform 2013; 14: 178-92.-   8 Wickham H. ggplot2. Springer New York: New York, N.Y., 2009    doi:10.1007/978-0-387-98141-3.

Example 2. MRD Testing in AML Using Error-Corrected Sequencing

In acute myeloid leukemia (AML), minimal residual disease (MRD) testingfollowing treatment is accomplished using multiparameter flow cytometry,which targets clonal cell surface markers; or qPCR, which targetsleukemia-associated chromosomal translocations. While both methodsprovide prognostic information to a detection limit of 1:10,000 cells,these methods are useful in only a subset of leukemia patients¹⁻³.Conversely, leukemia-specific somatic mutations occur in virtually everycase of AML and present a potential target for residual diseaseassessment^(4,6). Our goal is to develop a sequencing-based platform todetect rare leukemic cells by their unique somatic mutation profile.Currently, next-generation sequencing is not sensitive enough to detectrare somatic mutations due to a 1% error rate. Fortunately, we haveadapted methods for error-corrected sequencing (ECS) to circumvent thislimitation⁶⁻⁹. Here, we have extended these methods for ECS withleukemia-specific genomic DNA capture to attempt to detect rare (<1%)persistent leukemic cells regardless of their specific somatic mutationprofile.

Remission blood samples from 15 individuals treated for de novo AML wereacquired. Subclonal somatic mutations in the remission samples were thenidentified. These results were used to quantify the burden of persistentleukemia. The results obtained were compared to conventional NGS andclinical findings.

To facilitate leukemia-specific capture Illumina TruSight Myeloid Panelwas used. The Panel captures 54 genes via 568 amplicons frequentlymutated in AML and targets 141 kb of genomic DNA (Table 6). The Panelmethod is depicted in FIG. 9A, FIG. 9B.

Future directions involve further development of the TruSightcapture/ECS protocol, assessment of persistent AML following treatmentusing leukemia-specific somatic mutations, and assessment of the role ofrare subclones arising in the hematopoietic compartment of healthyindividuals.

TABLE 6 Coverage Details Cumulative target region size ~141 kb Number oftarget genes  54 Amplicon size ~250 bp Number of amplicons 568Recommended mean coverage 5,000x Target minimum coverage   500x Percentexons covered at 500x  95

Example 3. Use of TruSight Myeloid Panel and ECS Protocol in a ClinicalStudy of Healthy Individuals

In collaboration with the Nurses Healthy Study, 20 healthy elderlyindividuals were enrolled to examine the clinical possibilities of theTruSight Myeloid panel and ECS methodology. Paired buffy coat sampleswere banked 10 years apart. The average age at collection of the firstsample was 57.1 years and the average age at collection of the secondsample was 68.5. Samples were prepared in duplicate (80 libraries total)using the IIlumina TruSight Myeloid panel and the ECS protocol. Thesamples were sequence on 10 NextSeq High Output (PE150) runs. Table 7presents a summary of the sequencing results. The output per run was˜400M PE reads. Table 8 shows that the libraries appear to be mixed inequimolar ratios. There are approximately 3M read families per library.FIG. 10 shows that the amplicon coverage between replicates iscorrelated. FIG. 10A shows that two libraries sequenced on the same run(NHS1) had an R-squared value of 0.9718 and FIG. 10B shows that twolibraries sequenced on different runs (NHS2, NHS6) had an R-squaredvalue of 0.7536. FIG. 11, which presents data from DNMT3A, shows thatthe coverage per amplicon is variable.

TABLE 7 Summary of Sequencing Results. Library Sequenced Reads NHS1364,776,941 NHS2 331,997,319 NHS3 361,510,360 NHS4 387,756,648 NHS5468,765,873 NHS6 433,606,686 NHS7 435,037,421 NHS8 516,437,915 NHS9519,524,765 NHS10 495,292,729

TABLE 8 Summary of Sequencing Results. Frac Li- Demux of Frac of Demuxbrary Index Reads Total ECS RFs Reads NHS1 CACCACAC 33,657,209 0.0922,860,180 0.085 NHS1 ACAGTGGT 33,448,218 0.092 2,804,567 0.084 NHS1ACAAACGG 33,392,978 0.092 2,784,229 0.083 NHS1 ACCCAGCA 39,268,072 0.1083,318,173 0.085 NHS1 ATCACGAC 38,483,255 0.105 3,333,626 0.087 NHS1CCCAACCT 35,157,811 0.096 3,027,607 0.086 NHS1 AACCCCTC 41,984,812 0.1153,558,516 0.085 NHS1 CAGATCCA 35,427,539 0.097 2,987,274 0.084

Subclonal single nucleotide variations (SNVs) were then called based onthe TruSight-ECS libraries generated. A position-specific binomial errormodel was used to identify rare subclonal SNVs. For each sample, wegenerated a position specific error profile from all of the sequencedlibraries in the study except for samples sequenced from the sameindividual (the other replicate from the same time point and bothreplicates from the other time point). Variants were reported if theirbinomial p-value was less that 0.05 after Bonferroni correction, thevariant was observed in at least 5 ECCSs, the VAF was greater than0.0001, and the variant was identified in at least two replicates fromone of the collection time points. The identification of constitutionaland rare SNVs in the samples is presented in FIG. 12. Forty-ninegermline SNVs were identified (0.5/1.0 VAF), 5 high VAF were identified(0.14-0.36 VAF) and 106 low VAF were identified (<0.1 VAF) for a totalof 160 SNVs detected. Additionally, rare subclones were detectedlongitudinally in NHS participants (FIG. 13). FIG. 14 presents the totalrare subclonal variants detected per individual. The majority of SNVswere present in the exonic regions. The subclonal variants were thenclassified. As shown in FIG. 15A, the majority of rare variants werepresent in the exonic regions followed by the intronic regions. Rarevariants were occasionally found in ncRNA, splicing region and UTR3.FIG. 15B shows that the vast majority of rare variants werenonsynonymous SNVs. Additionally, detected exonic variants clustered inthe DNMT3A and TET2 genes (FIG. 16). The intronic variants, in contrast,were more evenly distributed (FIG. 17). Notably, variants were notexclusively called in highly covered amplicons. FIG. 18A shows thehistogram coverage per amplicon and FIG. 18B shows the histogramcoverage per amplicon with variants called. It is also important to notethat the target space per gene does not correlate with the SNV calls pergene. FIG. 19A shows that the exonic mutations were distributedthroughout the target space in the gene and FIG. 19B shows that theintronic mutations were also distributed throughout the target space inthe gene. FIG. 20 presents the distribution of exonic mutations by genewhich different depending on the gene. For DNMT3A, mutations were onlynonsynonymous or stopgain. For TET2 and BCORL1, mutations werenonsynonymous, stopgain and synonymous.

Given that mutations in DNMT3A and TET2 were the most prevalentmutations, we analyzed where each of the mutations was found on thesetwo genes. Mapping out the spectrum of DNMT3A mutations shows theprevalence of early truncating mutations and the numerous missensemutations in the ZFN and methyltransferase domains (FIG. 21). Mappingout the spectrum of TET2 mutations shows several mutation in theoxygenase domain (FIG. 22). We then evaluated the distribution of raresubclonal mutations per person (FIG. 23). While the majority ofindividuals have mutations in DNMT3A, mutations in other genes were alsodetected in combination with the DNMT3A mutation.

In summary, we found that rare subclones harboring mutations inleukemia-associated genes are common in healthy individual (19/20individuals). We also found that subclones frequently harbor mutationsin DNMT3A (but not R882) and TET2. Additionally, since the samples weretaken about 10 years apart, we found that subclones are stable overtime. Notably, the detection of subclones is not likely due to coverageor target-space bias.

Example 4. VAF Measured with ECS Correlates with VAF Measured with ddPCR

We next sought to validate the COSMIC (Catalogue of Somatic Mutation inCancer) variants detected using ddPCR (FIG. 24). In the digital dropletvalidation method, 21 probes and 150,000 to 450,000 droplets per sampleor control were used. We found that VAF measured by ECS is highlycorrelated with VAF measured by ddPCR (R²=0.98) (FIG. 25A). Whenfocusing on the VAF of <0.01, the VAF measured by ECF still correlatedwith the VAF measured by ddPCR (R²=0.72) (FIG. 25B).

We then sought to identify the subclones found in various cells types.Accordingly, subclone identification using ddPCR was performed on flowsorted buffy coat samples. Samples were selected from 13 individuals andthen pan-leukocyte, myeloid, B-cells and T-cells were sorted. Thesorting conditions included the following: pan-leukocyte: BV421anti-CD45; myeloid: APC anti-CD33; B-cells: FITC anti-CD19; and T-cells:PE-CY7 anti-CD3. Enough DNA was extracted from the sorted samples toperform ddPCR without amplification, however variability in flow yieldwas detected (FIG. 26). We found that subclonal mutations are present inmultiple cellular lineages in all tested samples (FIG. 27). Theseresults are more apparent when focusing specifically on a VAF of <0.01(FIG. 28).

In summary, we showed that there is a high concordance between VAFmeasured with ECS and VAF measured with ddPCR. Additionally, wedemonstrated that subclonal mutations are present in distincthematopoietic lineages. It was also demonstrated that subcloneidentification is improved with indel calling.

What is claimed is:
 1. A method of identifying a genetic mutation in abiological sample comprising nucleic acid obtained from a subject, themethod comprising: a) amplifying one or more regions of interest fromthe biological sample comprising nucleic acid, wherein a plurality ofamplicons for each region of interest are generated; b) attaching anadapter and a random component to each amplicon generated in (a) andamplifying; c) sequencing the amplicons comprising the random componentgenerated in (b), wherein redundant reads are generated and wherein theredundant reads are grouped by the random component and a consensussequence is identified; and d) comparing the consensus sequence to areference sequence, wherein a consensus sequence that differs from thereference sequence comprises a genetic mutation.
 2. The method of claim1, wherein the biological sample comprises about 400 to about 800 ngnucleic acid.
 3. The method of claim 1, wherein one region of interestis amplified in step (a).
 4. The method of claim 1, wherein more thanone region of interest is amplified in step (a).
 5. The method of claim4, wherein the regions of interest are amplified by PCR for about 15 toabout 20 cycles generating a plurality of amplicons, the amplicons areseparated into more than one reaction vial, one primer pair is added toeach reaction vial, and the region of interest in each reaction vial andamplified for about 15 to about 20 cycles.
 6. The method of claim 1,wherein step (b) further comprises attaching an index sequence.
 7. Themethod of claim 6, wherein the adapter, the random component and theindex sequence are attached via ligation.
 8. The method of claim 1,wherein the adapter is a Y-shaped adapter.
 9. The method of claim 1,wherein the adapter is an IIlumina adapter.
 10. The method of claim 1,wherein the adaptor and the random component are attached via ligation.11. The method of claim 1, wherein the method identifies clinical silentsingle-nucleotide variations (SNVs).
 12. The method of claim 1, whereinthe method identifies a genetic mutation that is at a frequency of lessthan 1 in 1,000 in the sample.
 13. The method of claim 1, wherein themethod identifies a genetic mutation that is at a frequency of less than1 in 5,000 in the sample.
 14. The method of claim 1, wherein the methodidentifies a genetic mutation that is at a frequency of less than 1 in10,000 in the sample.
 15. A method of identifying a genetic mutation ina biological sample comprising nucleic acid obtained from a subject, themethod comprising: a) hybridizing a primer pool comprising one or moreprimer pairs specific to one or more regions of interest from thebiological sample comprising nucleic acid, extending from an upstreamprimer of the primer pair to a downstream primer of the primer pair, andligating the extension product to the downstream primer of the primerpair, wherein products comprising the regions of interest flanked bysequences required for amplification are generated; b) attaching anadapter comprising a random component and attaching an adaptercomprising an index sequence to the products from (a) and amplifying; c)sequencing the products comprising the random component generated in(b), wherein redundant reads are generated and wherein the redundantreads are grouped by the random component and a consensus sequence isidentified; and d) comparing the consensus sequence to a referencesequence, wherein a consensus sequence that differs from the referencesequence comprises a genetic mutation.
 16. The method of claim 15,wherein the biological sample comprises about 400 to about 800 ngnucleic acid.
 17. The method of claim 15, wherein more than 500 regionsof interest are sequenced.
 18. The method of claim 15, wherein unboundprimers are washed away prior to proceeding to (b).
 19. The method ofclaim 15, wherein the adapter is a Y-shaped adapter.
 20. The method ofclaim 15, wherein the adapter is an Illumina adapter.
 21. The method ofclaim 15, wherein the adapter comprising a random component and theadaptor comprising an index sequence are attached via PCR.
 22. Themethod of claim 15, wherein the method identifies clinical silentsingle-nucleotide variations (SNVs).
 23. The method of claim 15, whereinthe method identifies a genetic mutation that is at a frequency of lessthan 1 in 1,000 in the sample.
 24. The method of claim 15, wherein themethod identifies a genetic mutation that is at a frequency of less than1 in 5,000 in the sample.
 25. The method of claim 15, wherein the methodidentifies a genetic mutation that is at a frequency of less than 1 in10,000 in the sample.
 26. A method of detecting minimal residual disease(MRD) in a subject, the method comprising: a) hybridizing a primer poolcomprising one or more primer pairs specific to one or more regions ofinterest from a biological sample comprising nucleic acid obtained fromthe subject, extending from an upstream primer of the primer pair to adownstream primer of the primer pair, and ligating the extension productto the downstream primer of the primer pair, wherein products comprisingthe regions of interest flanked by sequences required for amplificationare generated; b) attaching an adapter comprising a random component andattaching an adapter comprising an index sequence to the products from(a) and amplifying; c) sequencing the products comprising the randomcomponent generated in (b), wherein redundant reads are generated andwherein the redundant reads are grouped by the random component and aconsensus sequence is identified; and d) comparing the consensussequence to a reference sequence, wherein a consensus sequence thatdiffers from the reference sequence comprises a genetic mutation and isindicative of MRD.
 27. The method of claim 26, wherein the subject istreated if a genetic mutation is detected.